#setting options for our code chunks
knitr::opts_chunk$set(message = FALSE, warning = FALSE, tidy = TRUE)The purpose of this project is to create a model for predicting the rating of an Airbnb based on some of its characteristics.
Airbnb is a vacation rental company that allows users to book privately owned residences to stay at overnight. The type of residence can vary from booking to booking: some are homes, others are rooms, some are apartments, and others offer unique experiences such as tents or RVs. These residences are often collectively referred to as ‘Airbnbs’.
Take a look at the website here.
The website itself is easy to navigate. Users can input the locations they are looking for, sort bookings by price, or even specify the type of experience they would like. Once a user picks a specific Airbnb they are interested in, they are able to look at more specific information about the Airbnb including its rating, reviews, amenities, amount of rooms, type of rooms, and of course the dates it’s available to book.
Like most other products, those that have better ratings are often times of more interest to consumers! If a host is looking to upload a new Airbnb, can they get a glimpse into how satisfied customers will be with their stay? Our model can help them make changes to the property that might ensure a better stay for their visitors.
First, we will load in our packages and data. The data comes from this Kaggle dataset that was updated about a year ago. It contains data on over 250,000 Airbnbs from ten cities (and additionally a separate dataset containing reviews for Airbnbs).
The full codebook is available to look at in the data folder, but some of the variables we will be looking at include:
listing_id - the Listing ID
host_id - the host ID
host_since - the date the Host joined
Airbnb
host_response_rate - percentage of times the Host
responds
host_acceptance_rate - percentage of times the Host
accepts a booking request
host_is_superhost - a binary field to determine if
the Host is a Superhost
host_total_listings_count - total listings the Host
has in Airbnb
host_has_profile_pic - a binary field to determine
if the Host has a profile picture
host_identity_verified - a binary field to determine
if the Host has a verified identity
property_type - the Listing property type
room_type - the Listingd room type
amenities - a list of amenities the Listing
includes
review_scores_rating - the Listing’s overall rating
out of 100
# loading in our packages
library(tidymodels)
library(tidyverse)
library(ggplot2)
library(maps)
library(corrplot)
library(forcats)
library(corrr)
library(yardstick)
library(sf)
library(mapview)
library(dplyr)
library(stringr)
library(janitor)
library(lubridate)
library(MASS)
library(knitr)
tidymodels_prefer()listings <- read.csv('/Users/Sofia/Desktop/PSTAT131/Final Project/Airbnb Data/Listings.csv')We have a grand total of 279,712 observations and 33 variables! This is a lot of data to wrangle, but we will do our best to clean out as much of it as we can.
Next, we will transform our data so that it is easier to work with when we began analyzing it!
We’ll start off by removing some of the more inconsistent variables
like the name of the Airbnb, host_location,
neighbourhood, and district.
complete_listings <- listings %>%
select(-name, -host_location, -neighbourhood, -district)This brings us down to 29 variables! Now as we take a look at our observations, we can see that many of the observations have incomplete entries with multiple missing entries. We will remove any incomplete cases from the dataset:
complete_listings <- complete_listings[complete.cases(complete_listings), ]This narrows down our dataset to 93,324 observations which is about a third of the data we had started with. This is still plenty of data to work with!
One of our variables is called amenities and contains a
list of amenities for each entry: this is likely very valuable
information, but we can not work with the data as it is. Instead, we
will make the most common amenities into dummy variables for each
Airbnb.
We start off by creating a separate dataframe with just
listing_id and amenities:
amenities_list <- complete_listings %>% select(c("listing_id", "amenities"))We then apply a function that cleans our text data and unlists each value in the list of amenities. We get a resulting dataframe with two columns for listing IDs and each corresponding amenity as a separate entity. There’s over two million amenities in our 93,324 Airbnbs!
amenitiesFunction <- function(n){
output <- n %>%
str_to_lower() %>% # make all amenities lowercase
str_split(",") %>% # split each list at a comma
unlist() %>% # unlist the amenities
str_replace_all("[[:punct:]]", "") %>% # get rid of any punctuation
str_trim() %>% # trim white spaces in amenities
tibble() # display as tibble
return(output)
}
# new column with our cleaned amenities
amenities_list$amenities_new <- map(.x = amenities_list$amenities,
.f = amenitiesFunction)
# unnest to see each amenity with the corresponding listing
amenities_list <- unnest(amenities_list, cols = amenities_new)
# remove the old column and rename the new amenities
amenities_list <- amenities_list %>% select(!amenities)
colnames(amenities_list)[2] = 'new_amenities'
# save our file
save(amenities_list, file = 'bigFiles/amenities_list')Since there is a lot of variation among the amenities, our next step
is to filter any infrequent ones, categorize them as Other,
and format it to fit the rest of our listing data:
# load file
load(file = 'bigFiles/amenities_list')
amenities_list2 <- amenities_list %>%
mutate(new_amenities = fct_lump(new_amenities, prop = .01)) %>% # classify others
mutate(row = row_number()) %>% # mutate a column with row counts
pivot_wider(names_from = new_amenities, values_from = row) # pivot widerUnfortunately, because there are multiple counts of
Other in some of the entries, pivot_wider
returns lists of values. To bypass this, I created a new data frame that
contained the lengths of each list which told us which listings have or
do not have certain amenities:
amenities_list3 <- tibble(.rows = 93324) # empty tibble
amenities_list3$listing_id <- amenities_list2$listing_id # add in listing_id
# function to check if the list is empty
for (n in amenities_list2[2:38]) {
n1 <- ifelse(lengths(n) == 0, 'f', 't') # classify whether t or f
amenities_list3[ , ncol(amenities_list3) + 1] <- n1 # rename columns
}
# append correct column names
colnames(amenities_list3) <- colnames(amenities_list2)
#save file
write_rds(amenities_list3, file = 'bigFiles/amenities_list_final')PHEW! That was a doozy. Now, let’s append these amenities to our original data frame:
# load file
amenities_final <- read_rds(file = 'bigFiles/amenities_list_final')
# bind amenities tibble and remove old amenities variable
complete_listings <- complete_listings %>%
cbind(amenities_final[2:38]) %>%
select(-amenities)
head(complete_listings) %>% as_tibble()That’s really nice to look at! However, we still have some things to take care of…
We will also categorize rarer property types as Other in
our main dataframe:
complete_listings <- complete_listings %>%
mutate(property_type = fct_lump(property_type, prop = .01))Then, we will create a column that contains the date when the user became a Host in lubridate format and one that only contains the year when the user became a Host for simplicity purposes:
complete_listings <- complete_listings %>%
mutate(date = parse_date(host_since, "%Y-%m-%d"),
host_since_year = year(date))We will make the year into a factor as well:
complete_listings$host_since_year <- complete_listings$host_since_year %>%
as.factor()And then we will transform all the variables that are either TRUE or FALSE into factors (including our amenities).
factorFunction <- function(n) {
n <- factor(n, levels = c("t", "f"))
return(n)}complete_listings[c(7,9,10,28:65)] <- lapply(complete_listings[c(7,9,10,28:65)], factorFunction)We’re almost done with the pre-processing of our data! Finally, we’re going to clean our variable names to make them easier to work with:
complete_listings <- complete_listings %>% clean_names()… and now we can begin work on our models!
To set up our models, we will be splitting our data, exploring it, creating a recipe, and splitting it once again into cross-validation folds.
To determine how successful our models are, we will be splitting our data: we will use some of it to train our models and we will use the remaining parts to test it. We set our seed (for replicability), make our split ratio 80/20, and split on ratings so we have similar distributions in both training and testing:
set.seed(128)
abnb_split <- initial_split(complete_listings, prop = 0.80,
strata = review_scores_rating)
abnb_tr <- training(abnb_split)
abnb_te <- testing(abnb_split)
abnb_split## <Training/Testing/Total>
## <74658/18666/93324>
In our training set, we will be working with 74,658 Airbnbs!
Now we finally get to take a look at the shape of our data!
Lets start by taking a look at the locations of our listings around the world:
# making the map
tr_map <- mapview(abnb_tr, xcol = "longitude", ycol = "latitude",
crs = 4326, grid = FALSE)
# save
save(tr_map, file = 'bigFiles/tr_map')# refining options
options(mapviewMaxPixels = 10000)
mapviewOptions(maxpoints = 2000, maxpolygons = 2000, maxlines = 2000)
#loading
load(file='bigFiles/tr_map')
tr_mapIt’s clear that we only have listings from a couple of cities. The
cities we have are: Bangkok, Cape Town,
Hong Kong, Istanbul, Mexico City,
New York, Paris, Rio de Janeiro,
Rome, and Sydney.
Lets see how these cities vary among each other in their rating distributions:
ggplot(abnb_tr, aes(review_scores_rating)) +
geom_histogram(fill = "indianred2", binwidth = 2) +
facet_wrap(~city, scales = "free_y") +
labs(
title = "Histogram of Reviews by City"
)The distributions look pretty similar in each city! Most of the
cities except for Rome show a large spike in perfect
scores.
Here we can look at the overall distribution of ratings among all Airbnbs in our data:
ggplot(abnb_tr, aes(review_scores_rating)) +
geom_histogram(bins = 60, fill = "indianred2") +
labs(
title = "Histogram of Reviews"
)We see an upward trend in ratings, but we have to keep in mind that we do not know how many total reviews they have received! Overall, people tend to be pretty satisfied with their stays, though we do see small spikes in data which indicate some variation.
Now, let’s look at the distribution of ratings by property type since that may play a large part into what customers are satisfied with:
ggplot(abnb_tr, aes(review_scores_rating)) +
geom_histogram(fill = "indianred2") +
facet_wrap(~property_type, scales = "free_y") +
labs(
title = "Histogram of Reviews by Property Type"
)Most of our graphs seem to be showing similar patterns to each other, though there are subtle differences between them. For example, Airbnbs that are classified as rooms in hotels have a bit of a spike around the 80 and the 90 point review marks. Airbnbs such as entire guest suites seem to have a fairly consistent exponential spike in reviews. This makes sense since the two property types are at different price points, so customers are–on average–paying for a better experience.
Speaking of pricing, we can take a look at the correlations between
price and review_scores_rating for each type
of property:
ggplot(abnb_tr, aes(review_scores_rating, price)) +
geom_point(alpha = 0.1, colour = 'indianred2') +
geom_smooth(se = FALSE, color = "black", size = 1) +
facet_wrap(~property_type, scales = "free_y") +
labs(
title = "Reviews versus Price by Property Type"
)As can be expected, the pricing and the review scores do tend to show a slight positive correlation, though a majority of the bookings stay on the less expensive end. A private room in bed and breakfast and entire guest suites show some unique graphs, but they are also among the ones with the least amount of reviews.
Out of curiosity, I am also going to take a look at the distribution of reviews for a random amenity. I’m personally curious about whether having hot water would or would not affect the rating of an Airbnb:
ggplot(abnb_tr, aes(review_scores_rating)) +
geom_bar(aes(fill = hot_water)) +
scale_fill_manual(values = c("indianred2", "skyblue"))Seems like people don’t mind not having hot water! Or at least they don’t expect to have it if it’s not included within the amenities. Nevertheless, it’s fun to look at!
Finally, let’s take a general look at how much each non-character variable correlates with the others:
# making numeric dataset
numeric_col <-
abnb_tr %>%
select_if(is.numeric) %>%
colnames()
# compute covarience
numabnb <- select(abnb_tr, all_of(numeric_col)) %>%
cor()
# corrplot
corrplot(numabnb, is.corr = FALSE, type = "lower", tl.col = 'black')We do not see a huge correlation between most of these points aside
from the total_review_rating and the subsequent breakdowns
of the scores. These correlations make sense given that the total review
scores rating is based on the makeup of the other scores! We can also
see that the amount of bedrooms correlates to the amount of people the
Airbnb can accommodate which also makes sense. Price seems
to have a slight positive correlation with both of those variables as
well. Finally, we see a correlation between latitude and
longitude (this also makes sense since we are only looking
at a select few cities) as well as a correlation between
host_id and booking_id. Though I’m not sure
about the inner workings of the Airbnb system, it would make sense to
assume that the booking_id is generated using the
host_id.
Now that we know a little bit more about what we’re working with, let’s finish up setting up our models!
In order to build our models, we will create a singular recipe that
will tell each model how to use the data we are giving it and what we
are trying to predict: in this case it is the
total_review_rating. Here, I am choosing to exclude a few
of the variables from our data set. I will be excluding:
latitude and longitude since in between
values won’t mean much to uslisting_id and host_id since they are
randomhost_since and date since we will be using
host_since_yearThen I will be creating dummy variables for all the nominal variables as well as normalizing all the predictors:
abnb_recipe <- recipe(review_scores_rating ~ host_response_time +
host_response_rate + host_acceptance_rate +
host_is_superhost + host_total_listings_count +
host_has_profile_pic + host_identity_verified + city +
property_type + room_type + accommodates + bedrooms +
price + minimum_nights + maximum_nights +
instant_bookable + shampoo + dishes_and_silverware +
heating + iron + kitchen + hair_dryer + essentials +
washer + bed_linens + refrigerator + hot_water + oven +
wifi + cooking_basics + long_term_stays_allowed +
dedicated_workspace + elevator + hangers + coffee_maker +
carbon_monoxide_alarm + smoke_alarm + other + microwave +
air_conditioning + free_street_parking + dryer +
fire_extinguisher + extra_pillows_and_blankets + tv +
cable_tv + first_aid_kit + private_entrance +
luggage_dropoff_allowed + free_parking_on_premises +
stove + host_greets_you + patio_or_balcony +
host_since_year, data = abnb_tr) %>%
step_dummy(all_nominal_predictors()) %>%
step_normalize(all_predictors()) %>%
prep()
bake(abnb_recipe, new_data = NULL)